Fault Tolerant Algorithms for Heat Transfer Problems1
نویسندگان
چکیده
With the emergence of new massively parallel systems in the High Performance Computing area allowing scientific simulations to run on thousands of processors, the mean time between failures of large machines is decreasing from several weeks to a few minutes. The ability of hardware and software components to handle these singular events called process failures is therefore getting increasingly important. In order for a scientific code to continue despite a process failure, the application must be able to retrieve the lost data items. The recovery procedure after failures might be fairly straightforward for elliptic and linear hyperbolic problems. However, the reversibility in time for parabolic problems appears to be the most challenging part because it is an ill-posed problem. This paper focuses on new fault-tolerant numerical schemes for the time integration of parabolic problems. The new algorithm allows the application to recover from process failures and to reconstruct numerically the lost data of the failed process(es) avoiding the expensive roll-back operation required in most Checkpoint/Restart schemes. As a fault tolerant communication library, we use the Fault Tolerant Message Passing Interface developed by the Innovative Computing Laboratory at the University of Tennessee. Experimental results show promising performances. Indeed, the three dimensional parabolic benchmark code is able to recover and to keep on running after failures, adding only a very small penalty to the overall time of execution. Index Terms Parallel numerical algorithms, Process fault tolerance, Parabolic problems
منابع مشابه
Voting Algorithm Based on Adaptive Neuro Fuzzy Inference System for Fault Tolerant Systems
some applications are critical and must designed Fault Tolerant System. Usually Voting Algorithm is one of the principle elements of a Fault Tolerant System. Two kinds of voting algorithm are used in most applications, they are majority voting algorithm and weighted average algorithm these algorithms have some problems. Majority confronts with the problem of threshold limits and voter of weight...
متن کاملVoting Algorithm Based on Adaptive Neuro Fuzzy Inference System for Fault Tolerant Systems
some applications are critical and must designed Fault Tolerant System. Usually Voting Algorithm is one of the principle elements of a Fault Tolerant System. Two kinds of voting algorithm are used in most applications, they are majority voting algorithm and weighted average algorithm these algorithms have some problems. Majority confronts with the problem of threshold limits and voter of weight...
متن کاملReliability and Performance Evaluation of Fault-aware Routing Methods for Network-on-Chip Architectures (RESEARCH NOTE)
Nowadays, faults and failures are increasing especially in complex systems such as Network-on-Chip (NoC) based Systems-on-a-Chip due to the increasing susceptibility and decreasing feature sizes. On the other hand, fault-tolerant routing algorithms have an evident effect on tolerating permanent faults and improving the reliability of a Network-on-Chip based system. This paper presents reliabili...
متن کاملOn Feasibility of Adaptive Level Hardware Evolution for Emergent Fault Tolerant Communication
A permanent physical fault in communication lines usually leads to a failure. The feasibility of evolution of a self organized communication is studied in this paper to defeat this problem. In this case a communication protocol may emerge between blocks and also can adapt itself to environmental changes like physical faults and defects. In spite of faults, blocks may continue to function since ...
متن کاملA Fault Tolerant Algorithms for the Minimization of Blocking Probability in Optical Burst Switching Network
One of the major concerns in the field of computer Network is how to Transfer large amount of data and Transfer that data without any congestion or faults. Optical Burst Switching networks are used today for the huge transfer of data. So, Fault Tolerance is an important Issue in the Optical burst switching network. Fault tolerant refers to the ability of the network to transfer the information ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007